Alluxio+Presto 查询

1、 总述

主机 cpu 内存 磁盘
p1(master) 16 core 64GB 100GB
P2 8 core 32GB 100GB
P2 8 core 32GB 100GB

之前我基于TPC-DS 做了很多轮测试,数据量也分布了几个维度,从结果行看性能提升并不明显,只是单表查询上略有提升,多表基本无变化。结果懒得贴上去了,具体可以看我提的issue。

TPC-DS 典型的单表查询

  • group by 、count
1
select ss_sold_Date_sk,count(*) as cnt from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk
  • group by、 order by、count
1
select ss_sold_Date_sk,count(*) as cnt from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk order by cnt desc,ss_sold_Date_sk limit 10
  • group by 、order by、count 、avg
1
select ss_sold_Date_sk,ss_wholesale_cost,avg(ss_item_sk) as cnt,count(distinct(ss_sales_price)) as avg1 from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk,ss_wholesale_cost  order by cnt desc,ss_sold_Date_sk limit 10

TPC-DS 典型的多表查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
select channel, col_name, d_year, d_qoy, i_category, COUNT(*) sales_cnt, SUM(ext_sales_price) sales_amt FROM (
SELECT 'store' as channel, 'ss_promo_sk' col_name, d_year, d_qoy, i_category, ss_ext_sales_price ext_sales_price
FROM store_sales, item, date_dim
WHERE ss_promo_sk IS NULL
AND ss_sold_date_sk=d_date_sk
AND ss_item_sk=i_item_sk
UNION ALL
SELECT 'web' as channel, 'ws_ship_customer_sk' col_name, d_year, d_qoy, i_category, ws_ext_sales_price ext_sales_price
FROM web_sales, item, date_dim
WHERE ws_ship_customer_sk IS NULL
AND ws_sold_date_sk=d_date_sk
AND ws_item_sk=i_item_sk
UNION ALL
SELECT 'catalog' as channel, 'cs_bill_hdemo_sk' col_name, d_year, d_qoy, i_category, cs_ext_sales_price ext_sales_price
FROM catalog_sales, item, date_dim
WHERE cs_bill_hdemo_sk IS NULL
AND cs_sold_date_sk=d_date_sk
AND cs_item_sk=i_item_sk) foo
GROUP BY channel, col_name, d_year, d_qoy, i_category
ORDER BY channel, col_name, d_year, d_qoy, i_category
limit 10

针对tpc-ds的sql,单表查询聚合函数多个联合使用,多表关联太多且又使用复杂的聚合函数。测试下来总是囫囵吞枣,很难侦查到有无alluxio时性能的变化,我觉得需要更细粒度的单表测试。

2、单表单聚合函数测试

表名:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

```sql
CREATE TABLE `store_sales`(
`ss_sold_time_sk` bigint,
`ss_item_sk` bigint,
`ss_customer_sk` bigint,
`ss_cdemo_sk` bigint,
`ss_hdemo_sk` bigint,
`ss_addr_sk` bigint,
`ss_store_sk` bigint,
`ss_promo_sk` bigint,
`ss_ticket_number` bigint,
`ss_quantity` int,
`ss_wholesale_cost` decimal(7,2),
`ss_list_price` decimal(7,2),
`ss_sales_price` decimal(7,2),
`ss_ext_discount_amt` decimal(7,2),
`ss_ext_sales_price` decimal(7,2),
`ss_ext_wholesale_cost` decimal(7,2),
`ss_ext_list_price` decimal(7,2),
`ss_ext_tax` decimal(7,2),
`ss_coupon_amt` decimal(7,2),
`ss_net_paid` decimal(7,2),
`ss_net_paid_inc_tax` decimal(7,2),
`ss_net_profit` decimal(7,2))
PARTITIONED BY (
`ss_sold_date_sk` bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'alluxio://p1:19998/user/hive/warehouse/tpcds_bin_partitioned_orc_50.db/store_sales'
TBLPROPERTIES (
'transient_lastDdlTime'='1574907136')

聚合查询

每个查询三次,最后一列为平均时间,单位毫秒。

  • presto+hdfs (sum)
1
2
3
20381	19798	19916	20031
19213 19564 19855 19544
20350 20671 19432 20151
  • presto+hdfs+alluxio(sum)
1
2
3
6479	6711	6816	6668
6669 7759 6179 6869
6465 7507 7567 7179
  • presto+hdfs (count)
1
2
3
19920	20434	19285	19879
19583 19369 19004 19318
19988 20062 20294 20114
  • presto+hdfs+alluxio(count)
1
2
3
6194	5739	5638	5857
6291 6482 5496 6089
6275 5651 5440 5788
  • presto+hdfs (count(xxx))
1
2
3
21166	18758	20217	20047
20514 19281 20184 19993
20574 19703 19994 20090
  • presto+hdfs+alluxio (count(xxx))
1
2
3
6203	6268	5857	6109
6484 6443 5758 6228
5432 6173 5811 5805
  • presto+hdfs (order by)
1
2
3
22183	20706	21496	21461
21398 20386 20828 20870
20357 20878 21721 20985
  • presto+hdfs+alluxio (order by )
1
2
3
14981	15045	14967	14997
14141 14594 14949 14561
14379 15013 14243 14545
  • presto+hdfs (avg)
1
2
3
21916	20552	19876	22114
19675 19231 20083 19663
20904 19721 20155 20260
  • presto+hdfs+alluxio (avg)
1
2
3
6555	5928	6317	6266
6514 6161 6255 6310
5856 6365 6586 6269

3、总结

单表常规聚合, 有alluxio性能大约2-3倍的提升,presto对于join的查询不擅长,join做分析性能没什么提升,具体也可以看我提交的issue。另外对于多个聚合函数的测试可能也会干扰测试结果,建议大家单独的测试,alluxio+presto 做adhoc对sql是有要求的,不是有缓存了,性能就提升了,需要仔细甄别。